Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies

نویسندگان

  • Alex Gittens
  • Aditya Devarakonda
  • Evan Racah
  • Michael F. Ringenburg
  • Lisa Gerhardt
  • Jey Kottalam
  • Jialin Liu
  • Kristyn J. Maschhoff
  • Shane Canon
  • Jatin Chhugani
  • Pramod Sharma
  • Jiyan Yang
  • James Demmel
  • Jim Harrell
  • Venkat Krishnamurthy
  • Michael W. Mahoney
  • Prabhat
چکیده

We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to TB-sized problems in particle physics, climate modeling and bioimaging. The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark’s data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bridging the Gap between HPC and Big Data frameworks

Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retainin...

متن کامل

A new approach for building recommender system using non negative matrix factorization method

Nonnegative Matrix Factorization is a new approach to reduce data dimensions. In this method, by applying the nonnegativity of the matrix data, the matrix is ​​decomposed into components that are more interrelated and divide the data into sections where the data in these sections have a specific relationship. In this paper, we use the nonnegative matrix factorization to decompose the user ratin...

متن کامل

Fabrication of Nanostructured Cu matrix Nanocomposites by High Energy Mechanical Milling and Spark Plasma Sintering

Spark plasma sintering (SPS) is a sintering process that is capable of sintering hard worked powders in short times. This technique was used to fabricate bulk Cu and Cu-SiC nanocomposites. Pure Cu and mixed powders of Cu including 4 vol% of SiC nanoparticles were mechanically alloyed for 25 h and sintered at 750˚C under vacuum condition by SPS method. Microstructures of the materials were chara...

متن کامل

B?J/?(?,K) Decays within QCD Factorization Approach

We used QCD factorization for the hadronic matrix elements to show that the existing data, in particular the branching ratios BR ( ?J/?K) and BR ( ?J/??), can be accounted for this approach. We analyzed the decay within the framework of QCD factorization. We have complete calculation of the relevant hard-scattering kernels for twist-2 and twist-3. We calculated this decays in a special scale ...

متن کامل

Anatomy of machine learning algorithm implementations in MPI, Spark, and Flink

With the ever-increasing need to analyze large amounts of data to get useful insights, it is essential to develop complex parallel machine learning algorithms that can scale with data and number of parallel processes. These algorithms need to run on large data sets as well as they need to be executed with minimal time in order to extract useful information in a time constrained environment. MPI...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1607.01335  شماره 

صفحات  -

تاریخ انتشار 2016